Skip to content

Conversation

@SteNicholas
Copy link
Member

@SteNicholas SteNicholas commented Dec 28, 2022

Change Logs

CleanPlanner should retain the earliest commits must not be later than the earliest pending commit. Meanwhile, HoodieTimelineArchiver should retain the clustering commit which instant can not be archived unless we ensure that the replaced files have been cleaned, without the replaced files metadata on the timeline, the fs view would expose duplicates for readers.

Impact

CleanPlanner retains the earliest commits must not be later than the earliest pending commit and HoodieTimelineArchiver retains the clustering commit which instant can not be archived unless we ensure that the replaced files have been cleaned

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@leesf leesf self-assigned this Dec 28, 2022
@leesf
Copy link
Contributor

leesf commented Dec 28, 2022

cc @stream2000

@SteNicholas
Copy link
Member Author

@danny0405, @zhuanshenbsj1, I have created this pull request to fix the problem mentioned in #7405 in which the implementation is a little complex. PTAL.

@yihua yihua self-assigned this Dec 29, 2022
@yihua yihua self-requested a review December 29, 2022 02:38
@yihua yihua added priority:critical Production degraded; pipelines stalled area:table-service Table services labels Dec 29, 2022
@zhuanshenbsj1
Copy link
Contributor

@danny0405, @zhuanshenbsj1, I have created this pull request to fix the problem mentioned in #7405 in which the implementation is a little complex. PTAL.

LGTM. Clean clustering instant sequentially,to keep no clean holes in timeline like archive seems much more easily to implement.

@SteNicholas
Copy link
Member Author

SteNicholas commented Dec 29, 2022

@leesf, could you please review this pull request? I have addressed above comments from @danny0405.

@SteNicholas
Copy link
Member Author

@yihua, could you please review this pull request? @leesf has approved this changes.

@yihua
Copy link
Contributor

yihua commented Jan 1, 2023

@SteNicholas Sorry for the delay. I'll review the PR soon.

@SteNicholas
Copy link
Member Author

@yihua, any comments for this pull request?

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to determine the earliest instant to archive wrt clustering seems to be okay but it is very complex now. It would be good to think about how to simplify this as a whole.

@yihua
Copy link
Contributor

yihua commented Jan 3, 2023

HUDI-5493 for revisiting the logic.

@yihua
Copy link
Contributor

yihua commented Jan 4, 2023

@SteNicholas could you check the CI failure?

@SteNicholas
Copy link
Member Author

@yihua, I have already fixed the CI failure and updated the ClusteringUtils. Meanwhile the current failure has nothing to do with this change. PTAL.

@SteNicholas
Copy link
Member Author

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

hudi-bot commented Jan 6, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@leesf leesf merged commit f745e64 into apache:master Jan 6, 2023
XuQianJin-Stars pushed a commit that referenced this pull request Feb 11, 2023
…han earliest pending commit (#7568)

(cherry picked from commit f745e64)
nsivabalan pushed a commit to nsivabalan/hudi that referenced this pull request Mar 22, 2023
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
if (config.getCleanerPolicy() == HoodieCleaningPolicy.KEEP_LATEST_COMMITS
&& commitTimeline.countInstants() > commitsRetained) {
earliestCommitToRetain = commitTimeline.nthInstant(commitTimeline.countInstants() - commitsRetained); //15 instants total, 10 commits to retain, this gives 6th instant in the list
Option<HoodieInstant> earliestPendingCommits = hoodieTable.getMetaClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor. singular for "earliestPendingCommits"

// Earliest commit to retain must not be later than the earliest pending commit
earliestCommitToRetain =
commitTimeline.nthInstant(commitTimeline.countInstants() - commitsRetained).map(nthInstant -> {
if (nthInstant.compareTo(earliestPendingCommits.get()) <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the timestamp compare apis in HoodieTimeline

HoodieTimeline.compareTimestamps(commit1Ts, GREATER_THAN, commit2Ts)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:table-service Table services priority:critical Production degraded; pipelines stalled

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

8 participants